Two-dimensional Clustering for Text Categorization
نویسندگان
چکیده
We propose a new method to improve the accuracy of Text Categorization using twodimensional clustering. In a number of previous probabilistic approaches, texts in the same category are implicitly assumed to be generated from an identical distribution. We empirically show that this assumption is not accurate, and propose a new framework based on twodimensional clustering to alleviate this problem. In our method, training texts are clustered so that the assumption is more likely to be true, and at the same time, features are also clustered in order to tackle the data sparseness problem. We conduct some experiments to validate the proposed two-dimensional clustering method.
منابع مشابه
Kernel PCA based clustering for inducing features in text categorization
We study dimensionality reduction or feature selection in text document categorization problem. We focus on the first step in building text categorization systems, that is the choice of efficiently representing numerically the natural language text. This numerical representation is going to be used by machine learning algorithms. We propose a representation based on word clusters. We build a ke...
متن کاملStudy on Feature Selection Methods for Text Mining
Text mining has been employed in a wide range of applications such as text summarisation, text categorization, named entity extraction, and opinion and sentimental analysis. Text classification is the task of assigning predefined categories to free-text documents. That is, it is a supervised learning technique. While in text clustering (sometimes called document clustering) the possible categor...
متن کاملImproving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملClustering on the Unit Hypersphere using von Mises-Fisher Distributions
Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a g...
متن کامل